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Abstract 

The widely-used automatic evaluation 
metrics cannot adequately reflect the flu¬ 
ency of the translations. The n-gram- 
based metrics, like BLEU, limit the max¬ 
imum length of matched fragments to n 
and cannot catch the matched fragments 
longer than n, so they can only reflect 
the fluency indirectly. METEOR, which 
is not limited by n-gram, uses the num¬ 
ber of matched chunks but it does not con¬ 
sider the length of each chunk. In this pa¬ 
per, we propose an entropy-based method, 
which can sufficiently reflect the fluency 
of translations through the distribution of 
matched words. This method can easily 
combine with the widely-used automatic 
evaluation metrics to improve the evalu¬ 
ation of fluency. Experiments show that 
the correlations of BLEU and METEOR 
are improved on sentence level after com¬ 
bining with the entropy-based method on 
WMT 2010 and WMT 2012. 

1 Introduction 

Automatic machine translation (MT) evaluation 
plays an important role in the evolution of MT. 
It not only evaluates the performance of MT sys¬ 
tems, but also provides guidance for the improve¬ 
ment of MT systems dOch, 200^ . 

The automatic MT evaluation metrics 


can be classified 

into 

three categories: 

lexicon-based methods (Papineni et al., 2002 

Snover et al., 2006 

Lavie and Agarwal, 2007 

Chen and Kuhn, 2011 


Chen et al., 20121, 

syntax-based methods (Liu and Gildea, 2005 

Owczarzak et al., 2007 [ 

Chan and Ng, 2008 

Zhu et al., 2010 Mehay and Brew, 2007 1 and 


semantic-based methods ( |Lo et al., 2012| |, accord¬ 
ing to the employed information type. Most of 
the lexicon-based metrics obtain the similarity 
between the reference and hypothesis based on 
n-gram, such as BLEU (Papineni et al., 20021 
and NISTiDoddington, 20021. BLEU obtains the 
score by a geometric mean of the n-gram preci¬ 
sions and a length-based penalty. NIST is closely 
related with BLUE but uses the arithmetic mean 
instead of geometric mean. Eor these metrics, 
the maximum length of matched fragments is 
limited to n, so they cannot catch the matched 
fragments longer than n. Some metrics which 
are not limited by n-grams relieve this problem, 
such as METEOR ( Lavie and Agarwal, 2007| l. 
METEOR uses the Emean of unigrams and a 
penalty. The penalty in METEOR is related to the 
number of matched chunk^ When the number of 
chunks in two sentence are the same, METEOR 
doesn’t distinct them. The syntax-based metrics 
obtain the similarity by comparing the syntactic 
structures of two trees, and they cannot reflect the 
fluency directly. Semantic-based metrics, such as 
MEANT dLo et al., 2012| ) which uses semantic 
role labeling (SRL) to match the predicate and ar¬ 
guments, mainly obtain the semantic information 
and do not consider the fluency. 

In this paper, we propose an entropy-based 
method which can not only exploit the chunks 
with the maximum matched length but also reflect 
the difference between the lengths of the chunks. 
This method can easily combine with the widely- 
used automatic evaluation metrics to improve the 
evaluation of fluency. In the experiments, the 
new method is used to combine with BLEU and 
METEOR, and the sentence level correlations of 


’The words in each chunk are in adjacent positions in the 
hypothesis, and are also mapped to unigrams that are in adja¬ 
cent positions in the reference. 















































BLEU and METEOR are improved on WMT 2010 
and WMT 2012. 

2 Entropy-based Method 

In this section, we introduce entropy and the 
entropy-based method (ENT) which can reflect the 
fluency of translations. 

2.1 Entropy 

Entropy is a measure of the uncertainty in a ran¬ 
dom variable. Shannon denoted the entropy H of 
a discrete random variable x with possible values 
xi,X 2 , ■■■iXn- The entropy is defined as Eormula 
([ill ( [Shannon, 2001| ). 

n 

H{X) = -Y,P{xi)log2P{xi) ( 1 ) 

i=l 

P{xi) is the probability of Xi showing up in the 
stream of characters. The more decentralized of 
the values xi,X 2 ,..., the higher of the entropy 
H(X). So the entropy can reflect the distribution of 
the values of variable x. 

2.2 Entropy-based Method 

In the automatic evaluation of machine translation, 
entropy can reflect the distribution of matched 
words. A lower entropy corresponds to a more 
concentrate distribution of matched words which 
represents a more fluent hypothesis. On the con¬ 
trary, a higher entropy corresponds to a more dis¬ 
perse distribution of matched words, which rep¬ 
resents a less fluent hypothesis. So the entropy- 
based method can reflect the fluency of transla¬ 
tions sufficiently by the distribution of the words. 

An example (a reference and three hypotheses) 
is shown as follows. 

• ref: There are books on the desk 

• hypl: There are books in that desk 

• hyp2: There are table on the book 

• hyp3: There are table on book the 

The matched words are in bold, hypl, hyp2 and 
hyp3 can all match four words, but the distribu¬ 
tion of the four words are different. The matched 
words are in two chunks for hypl and hyp2, and 
three chunks for hyp3. A smaller number of 
chunks represents a more concentrated distribu¬ 
tion of the matched words, and corresponds to a 


more fluent hypothesis. Erom this point of view, 
hypl and hyp2 are better than hyp3. hypl has the 
same number of chunks as hyp2 but the number of 
the matched words in the two chunks is (3, 1) for 
hypl and (2, 2) for hyp2. hypl is considered to be 
more fluent than hyp2. 

The details of the ENT are represented in fol¬ 
lowing three steps. Eirst, we obtain the matched 
words through the alignment of reference and hy¬ 
pothesis. The alignment is derived using Meteor 
Alignei[l. The matched words are considered to 
be in a chunk if they are continuous and appear in 
the same order in both reference and hypothesis. 
Second, the entropy of chunks is calculated using 
Eormula 

H = ( 2 ) 

i=l 

li is the length of the ith chunk, c is the num¬ 
ber of the chunks. L is the total number of the 
matched words. In the last step, the final score 
of ENT is achieved by Eormula (O. To obtain a 
score within scope (0,1), an exponential function 
is used. We use —H instead of H in the formula 
to ensure that a higher score of ENT represents a 
more fluent translation. 

ENT = a €(1,1.5) (3) 

LP, a length penalty, is calculated by Eormula ([Hi. 
Ih is the length of hypothesis. Ir is the length of 
reference. 

h 

LP = plr , /3e(l,2) (4) 

Using Eormula (l3]l, the scores in the above ex¬ 
ample can be obtained as follows. 

LPfiypl — LPfiyp 2 — TPhyp'i — ^ — 1 

ENThypi = 

ENThyp2 = a-(-(i'°5|-H|W|))xi _ q,-o.30 
ENThypS = a-(-(3'°ff3+2X3Wi))xl _ ^-0.45 

We can see that ENT^ypi > ENThyp 2 > 
ENThyps- Accordingly, the quality of hypl is ob¬ 
viously better than hyp2, and hyp2 is better than 
hyp3. So the entropy-based method can distinct 
these situations well. 

'http://www.cs.emu.edu/-alavie/METEOR/ 






The alignment of reference and hypothesis is 
derived only using exact match for ENT. We can 
also use linguistic resources to get the alignment, 
such as stem dPorter, 200 1[ ), synonym (Wordnej^ 
and paraphrase. In this case, we name the new 
method as ENTplus (ENTp). 


3 Combine Entropy-based Method with 
Other Metrics 

The new entropy-based method can effectively 
measure the fluency of a sentence. Most of the cur¬ 
rent metrics are good at the measure of accuracy, 
so we combine the entropy-based method with the 
widely-used automatic evaluation metrics to fur¬ 
ther improve the performance of these metrics. In 
this section, we take BEEU and METEOR as ex¬ 
amples to introduce the combination method, but 
the entropy-based method can combine with most 
of the widely-used evaluation metrics. 

3.1 Combine Entropy-based Method with 
BLEU 

BEEU is a widely-used automatic evaluation 
metric owing to its simplicity and effective¬ 
ness. BEEU is calculated by Eormula (|5]l 
([Papineni et ah, 2002|). 


N 

BLEU = exp(^^ Wnlogpn) x BP (5) 

n=l 


3.2 Combine Entropy-based Method with 
METEOR 

METEOR is calculated by Eormula ([HI), in which 
Pen is calculated by Eormula (|9ll. 

METEOR = Fmean x (1 — Pen) (8) 


Pen = xl 


^chunks 

^unigrams-matched 


\x2 


( 9 ) 


The first part in Eormula ([8]l is the fmean of uni¬ 
grams. The second part is related with the number 
of chunks. METEOR doesn’t consider the length 
of each chunk, so it cannot reflect the situation that 
two hypotheses have the same number of matched 
unigrams and the same number of chunks, but dif¬ 
ferent lengths for each chunk. We use ENT instead 
of 1 — Pen to reflect the above situation, and the 
final score can be computed in Eormula (ITOl) . The 
experience value of a and fi are 1.5 and 1.12 re¬ 
spectively. 


METEORent = Fmean x (10) 


4 Experiments 

To compute the correlation with human judges on 
sentence level, Kendall’s rank correlation coeffi¬ 
cient r is employed. A higher value of r means 
a better ranking similarity with the human judg¬ 
ments. T is calculated as follows. 


BP 


1 if c> r 
g(i-r/c) if c < r 


( 6 ) 


In Eormula ([5]), the first part is a geometric mean 
of the n-grams precision where pn is the precision 
of re-gram, and the second part is a length-based 
penalty as shown in Eormula (O. There is also a 
length penalty in ENT. So we remove the part of 
length penalty in ENT when combining ENT with 
BEEU (Eormula |7]). The experience value of a is 
1.05. 


N 

BEEUeatt = BP X expC^wJogpn) 

n=l 


X a ^ 

(V) 


r = 


COUnt^Qji pairs ttOUntfUg pairs 
COUntiaial pairs 


count con pairs is the count of concordant pairs. 
countdis pairs is the count of discordant pairs. 

4.1 Data 

In order to verify the effectiveness of ENT, we 
carry out the experiments on WMT 2010 and 
WMT 2012. There are four language pairs includ¬ 
ing German-to-English (de-en), Czech-to-English 
(cz-en), Erench-to-English (fr-en) and Spanish-to- 
English (es-en), which are all derived from WMT 
2010 with 2034 sentences and WMT 2012 with 
3003 sentences. The number of translation sys¬ 
tems for each language pair is showed in Table [U 


’http://wordnet.princeton.edu/ 









Data 

Metrics 

cz-en 

de-en 

es-en 

fr-en 

ave 

WMTIO 

BEEU 

0.2554 

0.2748 

0.2805 

0.2197 

0.2576 

BEEU-fENT 

0.2565 

0.2730 

0.2822 

0.2211 

0.2582 

BEEU-fENTp 

0.2643 

0.2823 

0.3010 

0.2368 

0.2711(-fl.35) 

WMT 12 

BEEU 

0.1567 

0.1840 

0.1938 

0.1999 

0.1836 

BEEU-fENT 

0.1660 

0.1907 

0.1940 

0.2060 

0.1892 

BEEU-fENTp 

0.1732 

0.1989 

0.2052 

0.2208 

0.1995(+1.59) 


Table 2: Sentence level correlations of BLEU, BLEU+ENT and BEEU+ENTp on WMT 2010 and WMT 
2012. The last column gives the average scores of the four language pairs. 


Data 

Metrics 

cz-en 

de-en 

es-en 

fr-en 

ave 

WMTIO 

METEOR 

0.3292 

0.3585 

0.3283 

0.2710 

0.3218 

METEOR-fENTp 

0.3354 

0.3593 

0.3586 

0.2923 

0.3364(+1.46) 

WMT 12 

METEOR 

0.2124 

0.2748 

0.2493 

0.2506 

0.2468 

METEOR-fENTp 

0.2153 

0.2730 

0.2585 

0.2539 

0.2502(+0.34) 


Table 3: Sentence level correlations of METEOR and METEOR+ENTp on WMT 2010 and WMT 2012. 
The last column gives the average scores of the four language pairs. 


data 

cz-en 

de-en 

es-en 

fr-en 

WMT2010 

12 

25 

15 

24 

WMT2012 

6 

16 

12 

15 


Table 1: The number of translation systems for 
each language pair on WMT 2010 and WMT 
2012 . 

4.2 Experiment Results 

The correlations of BEEL0 are the results of 4- 
gram with smoothing option. According to the dif¬ 
ferent methods of obtaining the chunks, we try two 
configurations, BEEU-i-ENT and BEEU-i-ENTp. 
BEEU-i-ENT is only using the exact match. 
BEEU-i-ENTp is using some resources which are 
stem, synonym and paraphrase. The correlations 
of METEOR are obtained from the released data 
of WMT 2010 (Version 1.27 ) and WMT 2012 
(Version 1.48 ) with task option rank. We only 
do the experiment using outside resources (ME- 
TEOR-i-ENTp), because METEOR also uses the 
outside resources. H 

The sentence level correlations of the four lan¬ 
guage pairs and the average scores are shown in 
Table E and Table O In Table E BEEU-tENT is 
better than BEEU on both WMTIO and WMT12, 
but the result is only improved a little compared 

"'ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-vl3a.pl 

^Interested readers can find the 

source code of ENT and ENTp from 
https://github.com/YuHui0117/AMTE/tree/master/ENTFp. 


with BEEU. The reason is that the alignment is 
not good enough when only using exact match. 
When using stem, synonym and paraphrase, the 
result has a significant improvement of 1.35 on 
WMT 2010 and 1.59 on WMT 2012 respectively 
when comparing with BEEU. The number of ref¬ 
erence is limited, and linguistic resources can en¬ 
rich the reference, so ENTp can get better perfor¬ 
mance than ENT. 

Erom Table |3j we can see that METEOR-i-ENTp 
has a significant improvement (1.46 on average) 
on WMT 2010, while the improvement on WMT 
2012 (0.34 on average) is not as much as on WMT 
2010. The METEOR version on WMT 2012 op¬ 
timizes the parameters on the data of WMT 2009 
and WMT 2010. We didn’t tune the parameters af¬ 
ter combing METEOR with entropy-base method, 
so the improvement is not very significant. 

In all, when combining the entropy-based 
penalty with the widely-used automatic evaluation 
metrics, such as BEEU and METEOR, the perfor¬ 
mance can be improved, which proves the effec¬ 
tiveness of the entropy-based method. 

5 Conclusion and Future Work 

In this paper, we use entropy to reflect the flu¬ 
ency of fhe franslation, and propose an enfropy- 
based mefhod. When combing fhe entropy-based 
method with the widely-used automatic evaluation 
metrics, such as BEEU and METEOR, the perfor¬ 
mances of these metrics are improved. 


































One purpose of automatic evaluation metrics is 
to improve the quality of machine translation sys¬ 
tems. So, in the future, we will use the combi¬ 
nation of entropy-based method and widely-used 
metrics in the tuning process to improve the trans¬ 
lation quality, such as MERT (Minimum Error 
Rate Training) dOch, 20()^ . 
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